The midterm attempts to bring together everything you have learned to date. You’ll be asked to replicate a series of graphics to demonstrate your skills and provide short descriptions/explanations regarding issues and concepts in ggplot2.
You are free to use any resource at your disposal such as notes, past labs, the internet, fellow students, instructor, TA, etc. However, do not simply copy and paste solutions. This is a chance for you to assess how much you have learned and determine if you are developing practical data visualization skills and knowledge.
Datasets
The datasets used for this dataset are stephen_curry_shotdata_2014_15.txt, ga_election_data.csv, and ga_map.rda. May also need the nbahalfcourt.jpg image.
Below you can find a short description of the variables contained in stephen_curry_shotdata_2014_15.txt:
GAME_ID - Unique ID for each game during the season
PLAYER_ID - Unique player ID
PLAYER_NAME - Player’s name
TEAM_ID - Unique team ID
TEAM_NAME - Team name
PERIOD - Quarter or period of the game
MINUTES_REMAINING - Minutes remaining in quarter/period
SECONDS_REMAINING - Seconds remaining in quarter/period
EVENT_TYPE - Missed Shot or Made Shot
SHOT_DISTANCE - Shot distance in feet
LOC_X - X location of shot attempt according to tracking system
LOC_Y - Y location of shot attempt according to tracking system
The ga_election_data.csv dataset contains the state of Georgia’s county level results for the 2020 US presidential election. Here is a short description of the variables it contains:
County - name of county in Georgia
Candidate - name of candidate on the ballot,
Election Day Votes - number of votes cast on election day for a candidate within a county
Absentee by Mail Votes - number of votes cast absentee by mail, pre-election day, for a candidate within a county
Advanced Voting Votes - number of votes cast in-person, pre-election day, for a candidate within a county
Provisional Votes - number of votes cast on election day for a candidate within a county needing voter eligibility verification
Total Votes - total number of votes for a candidate within a county
We have also included the map data for Georgia (ga_map.rda) which was retrieved using tigris::counties().
Using the stephen_curry_shotdata_2014_15.txt dataset replicate, as close as possible, the graphics below (2 required, 1 optional/bonus). After replicating the graphics provide a summary of what the graphics indicate about Stephen Curry’s shot selection (i.e. distance from hoop) and shot make/miss rate and how they relate/compare across distance and game time (i.e. across quarters/periods).
Plot 1
Hints:
Figure width 6 inches and height 4 inches, which is taken care of in code chunk yaml with fig-width and fig-height
Use minimal theme and adjust from there
While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 12 & 14
Figure width 6 inches and height 4 inches, which is taken care of in code chunk yaml with fig-width and fig-height
Use minimal theme and adjust from there
Useful hex colors: "#5D3A9B" and "#E66100"
No padding on vertical axis
Transparency is being used
annotate() is used to add labels
While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0, 0.04, 0.07, 0.081, 0.25, 3, 12, 14, 27
Plot 3 is required for graduate students, but is optional for undergraduate students.
Hints:
Figure width 7 inches and height 7 inches, which is taken care of in code chunk yaml with fig-width and fig-height
Colors used: "grey", "red", "orange""yellow" (don’t have to use "orange", you can get away with using only "red" and "yellow")
To top code so 15+ is the highest value, you need to set the limits in the appropriate scale while also also setting the na.value to the top color
While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0, 0.7, 5, 12, 14, 15, 20
basecourt +geom_hex(data = shots,aes(loc_x, loc_y),bins =20,alpha =0.7,color ="grey" ) +scale_fill_gradient(name ="Shot\nAttempts",low ="yellow",high ="red",breaks =c(0, 5, 10, 15),labels =c(0, 5, 10, "15+"),# set limit of shots to 0 to 15limits =c(0, 15),# any shots above 15 will be 'na' so set color to high value = redna.value ="red" ) +labs(subtitle ="Shot Chart (2014-2015)",title ="Stephen Curry" ) +theme_void() +theme(plot.title =element_text(face ="bold", size =14))
Summary
Provide a summary of what the graphics above indicate about Stephen Curry’s shot selection (i.e. distance from hoop) and shot make/miss rate and how they relate/compare across distance and game time (i.e. across quarters/periods).
In plot 1: Unsurprisingly, there are the fewest shots during OT for both missed and made- which makes sense since not every game has an OT period. For all periods, the made shots for each quarter each have a longer IQR and lower median than the missed shots. And in plot 2:The density graph shows a similar trend from the box plots: made shots on average have a shorter distance than missed shots. These densities are both bi-modal, peaking around 5f and again around 25 ft. The made shots has the higher pick at 5ft (easier to make shots at a shorter distance) and the missed shots has a higher peak at 25 ft (harder to make shots at a longer distance.In plot 3: Stephen Curry made most of his shots in front of the basket, where his made-shots is a lot more than the missed-shots. And he also made most of his shots 25fts away from the baskets, where he made more missed-shots. These are the spots where most basketball player would do their shots.
Exercise 2
Using the ga_election_data.csv dataset in conjunction with mapping data ga_map.rda replicate, as close as possible, the graphic below. Note the graphic is comprised of two plots displayed side-by-side. The plots both use the same shading scheme (i.e. scale limits and fill options).
Background Information: Holding the 2020 US Presidential election during the COVID-19 pandemic was a massive logistical undertaking. Voter engagement was extremely high which produced a historical high voting rate. Voting operations, headed by states, ran very monthly and encountered few COVID-19 related issues. The state of Georgia did a particularly good job at this by encouraging their residents to use early voting. About 75% of the vote in a typical county voted early! Ignoring county boundaries, about 4 in every 5 voters, 80%, in Georgia voted early.
While it is clear that early voting was the preferred option for Georgia voters, we want to investigate whether or not one candidate/party utilized early voting more than the other — we are focusing on the two major candidates. We created the graphic below to explore the relationship of voting mode and voter preference, which you are tasked with recreating.
After replicating the graphic provide a summary of how the two maps relate to one another. That is, what insight can we learn from the graphic.
Hints:
Figure width 7 inches and height 7 inches, which is taken care of in code chunk yaml with fig-width and fig-height
Make two plots, then arrange plots accordingly using patchwork package
patchwork::plot_annotation() will be useful for adding graphic title and caption; you’ll also set the theme options for the graphic title and caption (think font size and face)
ggthemes::theme_map() was used as the base theme for the plots
scale_*_gradient2() will be helpful
Useful hex colors: "#5D3A9B" and "#1AFF1A"
While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0.5, 0.75, 1, 10, 12, 14, 24
Important
Add comments to the code below where indicated. The added comments should concisely describe what the following line(s) of code do in the data wrangling process
Code
# dataga_graph <- ga_election_dat %>%# get the proportion of absentee and advanced votes out of total votesmutate(prop_pre_eday = (absentee_by_mail_votes + advanced_voting_votes) / total_votes ) %>%# Get rid of all the -vote variablesselect(-contains("_vote"))# biden map databiden_map_data <- ga_map %>%# combine two tables together on the same country and the same candidatesleft_join( ga_graph %>%filter(candidate =="Joseph R. Biden"),by =c("name"="county") )# trump map datatrump_map_data <- ga_map %>%# combine two tables together on the same country and the same candidatesleft_join( ga_graph %>%filter(candidate =="Donald J. Trump"),by =c("name"="county") )# biden plotbiden <-ggplot(data = biden_map_data) +geom_sf(aes(fill = prop_pre_eday), show.legend =FALSE) +scale_fill_gradient2(high ="#5D3A9B",low ="#1AFF1A",midpoint =0.75,limits =c(0.5, 1),# guide = "none" ) +theme_map() +ggtitle("Joseph R. Biden") +labs(subtitle ="Democratic Nominee")+theme(plot.title =element_text(face ="bold", size =14) )# trump plottrump <-ggplot(data = trump_map_data) +geom_sf(aes(fill = prop_pre_eday)) +scale_fill_gradient2(name =NULL,high ="#5D3A9B",low ="#1AFF1A",midpoint =0.75,limits =c(0.5, 1),# guide = "none",breaks =c(0.5, 0.75, 1),labels = scales::percent_format() ) +theme_map() +labs(title ="Donald J. Trump",subtitle ="Republican Nominee") +theme(plot.title =element_text(face ="bold", size =14),legend.justification =c(1, 1),legend.position =c(1, 1) )# final plot(biden + trump) +plot_annotation(title ="Percentage of votes from early voting",caption ="Georgia:2021 US Presidents Election Results",theme =theme(plot.title =element_text(face ="bold", size =24),plot.caption =element_text(size =12) ) )
Summary
Provide a summary of how the two maps relate to one another. That is, what insight can we learn from the graphic.
These two maps are have the opposite information. If one area in graph A shows a high percentage, then the same area in the other graph tends to show a low percentage.
Exercise 3
Part 1
In 3-5 sentences, describe the core concept/idea and structure of the ggplot2 package.
ggplot2 separates data, the mapping of data to graphic elements, and the drawing of graphic elements that are not related to data, similar to the MVC framework in java. This allows ggplot2 users to clearly feel the real components of a data analysis graph, and to develop and adjust accordingly.
Part 2
Describe each of the following:
ggplot():creates a new plot; this plot is blank without any other specifications
aes() : maps data to visual aesthetics
geoms:specifies the type of geometric object to visualize
stats:statistical transformation on the data rather than visualizing the data
scales: translation of data to visual properties
theme():visual appearances that are not related to the data
Part 3
Explain the difference between using this code geom_point(aes(color = VARIABLE)) as opposed to using geom_point(color = VARIABLE).
geom_point(aes(color = VARIABLE)): This maps data to color based on value(s) of VARIABLE, e.g. if there are three levels of VARIABLE, the data will be mapped to three different color based on the level. This option also automatically creates a legend to show which color is assigned to which level(s) of VARIABLE.
geom_point(color = VARIABLE): This makes all points one color called VARIABLE. No legend is created because there is no mapping from data to aesthetics.